[Test] Add some gsm8k configs for hybrid models. by tdoublep · Pull Request #35406 · vllm-project/vllm

tdoublep · 2026-02-26T15:20:19Z

Purpose

This PR adds some configs to the gsm8k testing framework that are very helpful for development on the hybrid models. I found this super helpful for debugging something I'm working on right now related to MTP + prefix caching + async scheduling.

Test Plan

They can be run with:

pytest -s -v tests/evals/gsm8k/test_gsm8k_correctness.py --config-list-file=tests/evals/gsm8k/configs/hybrid/models-h100.txt -k 'Qwen3-Next-FP8-TP4-MTP-Align'

Test Result

GSM8K Results for Qwen/Qwen3-Next-80B-A3B-Instruct-FP8:
  Measured metric: 0.8264
  Expected metric: 0.8500
  Tolerance: 0.0800
  Questions: 1319
  Invalid rate: 0.000
  Latency: 78.5s
  QPS: 16.8
✅ GSM8K test passed for Qwen/Qwen3-Next-80B-A3B-Instruct-FP8

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

gemini-code-assist

Code Review

This pull request adds several test configurations for gsm8k evaluation of hybrid models, specifically for Qwen3Next. My review found a critical issue in one of the new configuration files. The configuration for Qwen3-Next-FP8-TP4-MTP-Align enables prefix caching, which is not supported for hybrid models like Qwen3Next and will cause the engine to fail. This should be removed.

gemini-code-assist · 2026-02-26T15:22:20Z

+  --max-model-len 4096
+  --tensor-parallel-size 4
+  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
+  --enable-prefix-caching


The configuration enables prefix caching (--enable-prefix-caching) for the Qwen3Next model. Hybrid models like Qwen3Next do not support prefix caching, which will cause a ValueError during engine initialization. Please remove this argument.

robertgshaw2-redhat · 2026-02-26T15:31:23Z

@@ -0,0 +1,9 @@
+model_name: "Qwen/Qwen3-Next-80B-A3B-Instruct-FP8"


can we just have this one? or is it also useful to test the non spec decoding / non-prefix caching case

It's useful yeah. I put the 3 configs in a "hybrid" folder so as not to pollute what's there.

Alternatively, if we want to keep the number of configs to a minimum, maybe it could be useful to be able to pass additional overrides when passing the configs to pytest (if that isn't possible already).

github-actions · 2026-05-28T02:17:13Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

Add some gsm8k configs for hybrid models.

9064d55

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

tdoublep requested a review from mgoin as a code owner February 26, 2026 15:20

gemini-code-assist Bot reviewed Feb 26, 2026

View reviewed changes

robertgshaw2-redhat reviewed Feb 26, 2026

View reviewed changes

tdoublep mentioned this pull request Feb 26, 2026

[Perf] [Hybrid] Copy num_accepted_tokens in non-blocking way when not using prefix caching #35442

Merged

5 tasks

fuscof-ibm mentioned this pull request Mar 27, 2026

[Hybrid] Simplify accepted token counting in spec decode for hybrid models #38372

Merged

5 tasks

fuscof-ibm mentioned this pull request Apr 17, 2026

[Perf] [Hybrid] Fused Triton kernel for GPU-side Mamba state postprocessing #40172

Merged

github-actions Bot added the stale Over 90 days of inactivity label May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Test] Add some gsm8k configs for hybrid models.#35406

[Test] Add some gsm8k configs for hybrid models.#35406
tdoublep wants to merge 1 commit into
vllm-project:mainfrom
tdoublep:tpa-hybrid-eval

tdoublep commented Feb 26, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Feb 26, 2026

Uh oh!

robertgshaw2-redhat Feb 26, 2026

Uh oh!

tdoublep Feb 26, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,9 @@
		model_name: "Qwen/Qwen3-Next-80B-A3B-Instruct-FP8"

Uh oh!

Conversation

tdoublep commented Feb 26, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

tdoublep Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tdoublep commented Feb 26, 2026 •

edited by github-actions Bot

Loading